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Abstract. Decision tree learning is a popular approach for classification and 
regression in machine learning and statistics, and Bayesian formulations — 
which introduce a prior distribution over decision trees, and formulate learning 
as posterior inference given data — have been shown to produce competitive 
performance. Unlike classic decision tree learning algorithms like ID3, C4.5 
and CART, which work in a top-down manner, existing Bayesian algorithms 
produce an approximation to the posterior distribution by evolving a complete 
tree (or collection thereof) iteratively via local Monte Carlo modifications to 
the structure of the tree, e.g., using Markov chain Monte Carlo (MCMC). We 
present a sequential Monte Carlo (SMC) algorithm that instead works in a 
top-down manner, mimicking the behavior and speed of classic algorithms. 
We demonstrate empirically that our approach delivers accuracy comparable 
to the most popular MCMC method, but operates more than an order of 
magnitude faster, and thus represents a better computation-accuracy tradeoff. 



1. Introduction 

Decision tree learning algorithms are widely used across statistics and machine 
learning, and often deliver near state-of-the-art performance despite their simplicity. 
Decision trees represent predictive models from an input space, typically R D , to an 
output space of labels, and work by specifying a hierarchical partition of the input 
space into blocks. Within each block of the input space, a simple model predicts 
labels. 

In classical decision tree learning, a decision tree (or collection thereof) is learned 
in a greedy, top-down manner from the examples. Examples of classical approaches 
that learn single trees include ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and CART 
(Breiman et al., 1984), while methods that learn combinations of decisions trees 
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include boosted decision trees (Friedman, 2001), Random Forests (Breiman, 2001), 
and many others. 

Bayesian decision tree methods, like those first proposed by Bvmtinc (1992), 
Chipman et al. (1998), Dcnison ct al. (1998), and Chipman and McCulloch (2000), 
and more recently revisited by Wu et al. (2007), Taddy et al. (2011) and Anag- 
nostopoulos and Gramacy (2012), cast the problem of decision tree learning into 
the framework of Bayesian inference. In particular, Bayesian approaches start by 
placing a prior distribution on the decision tree itself. To complete the specification 
of the model, it is common to associate each leaf node with a parameter indexing a 
family of likelihoods, e.g., the means of Gaussians or Bernoullis. The labels are then 
assumed to be conditionally independent draws from their respective likelihoods. 
The Bayesian approach has a number of useful properties: e.g., the posterior dis- 
tribution on the decision tree can be interpreted as reflecting residual uncertainty 
and can be used to produce point and interval estimates. 

On the other hand, exact posterior computation is typically infeasible and so 
existing approaches use approximate methods such as Markov chain Monte Carlo 
(MCMC) in the batch setting. Roughly speaking, these algorithms iteratively im- 
prove a complete decision tree by making a long sequence of random, local modifica- 
tions, each biased towards tree structures with higher posterior probability. These 
algorithms stand in marked contrast with classical decision tree learning algorithms 
like ID3 and C4.5, which rapidly build a decision tree for a data set in a top-down 
greedy fashion guided by heuristics. Given the success of these methods, one might 
ask whether they could be adapted to work in the Bayesian framework. 

In this article, we present such an adaptation, proposing a sequential Monte 
Carlo (SMC) method for approximate inference in Bayesian decision trees that 
works by sampling a collection of trees in a top-down manner like ID3 and C4.5. 
Unlike classical methods, there is no pruning stage after the top-down learning stage 
to prevent over-fitting, as the prior combines with the likelihood to automatically 
cut short the growth of the trees, and resampling focuses attention on those trees 
that better fit the data. In the end, the algorithm produces a collection of sampled 
trees that approximate the posterior distribution. While both existing MCMC 
algorithms and our novel SMC algorithm produce approximations to the posterior 
that are exact in the limit, we show empirically that our algorithms run more 
than an order of magnitude faster than existing methods while delivering the same 
predictive performance. 

The article is organized as follows: we begin by describing the Bayesian decision 
tree model precisely in Section 2, and then describe the SMC algorithm in detail in 
Section 3. Through a series of empirical tests, we demonstrate in Section 4 that this 
approach is fast and produces good approximations. We conclude in Section 5 with 
a discussion comparing this approach with existing ones in the Bayesian setting, 
and point towards future avenues. 



2. Model and notation 

In this section, we present the decision tree model for the distribution of the 
labels Y = {y n }n=i corresponding to input vectors X = {x n }^ =1 , x n £ R D . The 
assumption is that the probabilistic mapping from input vectors to their labels is 
mediated by a latent decision tree T that serves to partition the input space into 
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axis-aligned blocks. Each block is then associated with a parameter that determines 
the distribution of the labels of the input vectors falling in that block. 

A rooted, strictly binary tree T is a finite tree with a single root, denoted by the 
empty string e, where each internal node p except the root has exactly two children, 
called the left child pO and the right child pi. Denote the leaves of T (those nodes 
without children) by <9T. Each node of the tree p £ T is associated with a block 
B(p) C K D of the input space as follows: At the root we have B(e) = M. D , while each 
internal node p £ T \ dT "cuts" its block into two halves, with n(p) £ {1, . . . , D} 
denoting the dimension of the cut, and t(j>) denoting the location of the cut, so 
that 

B(pO) = B(p) n {z £ R D : z K(p) < rip)} and 

B(pl) = B{p) n {z £ R D : z K[p) > r{p)}. (1) 

We call the tuple T = (T, k, t) the decision tree. (See Figure 1 for more intuition on 
the representation and notation of decision trees.) Note that the blocks associated 
with the leaves of the tree partition M. D . It will be convenient to write N{p) for 
the set of data point indices n £ {1, . . . , N} such that x n £ B(p). For every subset 
A £ {1, . . . , N}, let Ya ■— {y n ■ n £ A} and similarly for A4, so that X N ^ are 
the input vectors in block Bip) and Y N ^ are their labels. Note that both B(p) and 
N(p) depend on T, although we have chosen to elide this dependence for notational 
simplicity. 

Conditioned on the examples X, we assume that the joint density f(Y,T\ X) of 
the labels Y and the latent decision tree T factorizes as follows: 

f{Y,T\X) = hiT\X)g{Y\T,X) 

= KT\X)Yl peaT t{Y N{p) \X N{p) ) (2) 

where i denotes a likelihood, defined below. 

In this paper, we focus on the case of categorical labels taking values in the 
set {1, . . . ,K}. It is natural to take I to be the Dirichlet-Multinomial likelihood, 
corresponding to the data being conditionally i.i.d. draws from a multinomial dis- 
tribution on {1, ... , K} with a Dirichlet prior. In particular, 

r(a) nf=ir(m pfc + f) 

where m p k denotes the number of labels y n = k among those n £ N(p) and a 
is the concentration parameter of the symmetric Dirichlet prior. Generalisations 
to other likelihood functions based on conjugate pairs of exponential families are 
straightforward. 

The final piece of the model is the prior density h{T\X) over decision trees. In 
order to make straightforward comparisons with existing algorithms, we adopt the 
model proposed by Chipman et al. (1998). In this model, the prior distribution of 
the latent tree is defined conditionally on the given input vectors X (see Section 5 
for a discussion of this dependence on X and its effect on the exchangeability of 
the labels). Informally, the tree is grown starting at the root, and each new node 
either splits and grows two children (turning the node into an internal node) or 
stops (leaving it a leaf) stochastically. 

We now describe the generative process more precisely in terms of a Markov 
chain capturing the construction of a decision tree in stages, beginning with the 
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Figure 1. A decision tree T = (T, k, t) represents a hierarchical 
partitioning of a space. Here, the space is the unit square and 
the tree T contains the nodes {e, 0, 1, 10, 11}. The root node e 
represents the whole space B(e) = M. D , while its two children 
and 1, represent the two halves of the cut (ft(e),r(e)) = (1,0.5), 
where = 1 represents the dimension of the cut, and r(e) = 0.5 
represents the location of the cut along that dimension. (The origin 
is at the bottom left of each figure, and the x-axis is dimension 1. 
The red stars and blue circles represent observed data points.) The 
second cut, (k(1),t(1)) = (2,0.35), splits the block B(l) into the 
two halves -B(ll) and B(10). When defining the prior over decision 
trees given by Chipman et al. (1998), it will be necessary to refer to 
the "extent" of the data in a block. E.g., and 7° are the extent 
of the data in dimensions 1 and 2, respectively, in block B(Q). For 
each node p, the set D p contains those dimensions with non-trivial 
extent. Here, D° — {1,2}, but D 10 = {2}, because there is no 
variation in dimension 1. 
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trivial tree To = {e} containing only the root node. At each stage i, Tj is produced 
from Tj_i by choosing one leaf in Tj_i and either growing two children nodes or 
stopping the leaf. Once stopped, a leaf is ineligible for future growth. The identity 
of the chosen leaf is deterministic, while the choice to grow or stop is stochastic. 
The process proceeds until all leaves are stopped, and so each node is considered 
for expansion exactly once throughout the process. This will be seen to give rise to 
a finite sequence of decision trees % = (Tj, «;,,Tj) once we define the associated cut 
functions Ki and Tj. We will use this Markov chain in Section 3 as scaffolding for 
a sequential Monte Carlo algorithm. A similar approach was employed by Taddy 
et al. (2011) in the setting of online Bayesian decision trees. There are similarities 
also with the bottom-up SMC algorithms by Teh et al. (2008) and Bouchard-Cote 
et al. (2012). 

We next describe the rule for stopping or growing nodes, and the distribution of 
cuts. Let p be the node chosen at some stage of the generative process. If the input 
vectors A/v( p ) are all identical, then the node stops and becomes a leaf. (Chipman 
et al. chose this rule because no choice of cut to the block B(p) would result in 
both children containing at least one input vector.) Otherwise, let D p be the set of 
dimensions along which Ajy( p ) varies, and let 1% = [min ne jv( p ) x n d, max„ e jv(p) x-nd] 
be the range of the input vectors along dimension d £ D p . (See last subfigure of 
Figure 1.) Under the Chipman et al. model, the probability that node p is split is 

h+Ly. ' a. G (0,1), AG [0.OO), (4) 

where \p\ is the depth of the node, and a s and j3 s are parameters governing the 
shape of the resulting tree. For larger a s and smaller f3 s the typical trees are larger, 
while the deeper p is in the tree the less likely it will be cut. If p is cut, the 
dimension n(p) and then location r(p) of the cut are sampled uniformly from D p 
and I^fp) ~ respectively Note that the choice for the support of the distribution over 
cut dimensions and locations are such that both children of p will, with probability 
one, contain at least one input vector. Finally, the choices of whether to grow or 
stop, as well the cut dimensions and locations, are conditionally independent across 
different subtrees. 

To complete the generative model, we define T = T n , k — k^ and r = T n , where 
r] is the first stage such that all nodes are stopped. We note that n < 2N with 
probability one because each cut of a node p produces a non-trivial partition of the 
data in the block, and a node with one data point will be stopped instead of cut. 
The conditional density of the decision tree T — (T, k, t) can now be expressed as 

Note that the prior distribution of T does not depend on the deterministic rule 
for choosing a leaf at each stage. However this choice will have an effect on the 
bias/variance of the corresponding SMC algorithm. 

3. Sequential Monte Carlo (SMC) for Bayesian decision trees 

In this section we describe an SMC algorithm for approximating the posterior 
distribution over the decision tree (T, k, t) given the labeled training data (X,Y). 
(We refer the reader to (Cappe et al., 2007) for an excellent overview of SMC tech- 
niques.) The approach we will take is to perform particle filtering following the 
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sequential description of the prior. In particular, at stage i, the particles approx- 
imate a modified posterior distribution where the prior on (T, k, t) is replaced by 
the distribution of (Tj, Ki,Ti), i.e., the process truncated at stage i. 

Let Ei denote the set of unstopped leaves at stage i, all of which are eligible for 
expansion. An important freedom we have in our SMC algorithm is the choice of 
which candidate leaf (or set Cj C Ei of candidate leaves) to consider expanding. In 
order to avoid "multipath" issues (Del Moral et al., 2006, §3.5) which lead to high 
variance, we fix a deterministic rule for choosing Cj C Ei. (Multiple candidates are 
expanded or stopped in turn, independently.) This rule can be a function of (X, Y) 
and the state of the current particle, as the correctness of resulting approximation 
is unaffected. We evaluate two choices in experiments: first, the rule Cj = Ei where 
we consider expanding all eligible nodes; and second, the rule where Cj contains a 
single node chosen in a breadth-first (i.e., oldest first) manner from £j. 

We may now define the sequence (P^) of target distributions. Recall the 
sequential process defined in Section 2. If the generative process for the decision tree 
has not completed by stage i, the process has generated (Tj,Kj,Tj) along with Ei, 
capturing which leaves in Tj have been considered for expansion in previous stages 
already and which have not. Let 7j = (Tj, Kj, Tj, Ei) be the variables generated on 
stage i, and write P for the prior distribution on the sequence (71). We construct the 
target distribution Vj as follows: Given %, we generate labels Y' with likelihood 
g(Y'\Ti, X), i.e., as if (Tj, ftj, Tj) were the complete decision tree. We then define P^ 
to be the conditional distribution of % given Y' = Y. That is, Pj is the posterior 
with a truncated prior. 

In order to complete the description of our SMC method, we must define pro- 
posal kernels (Qj) that sample approximations for the ith stage given values for 
the (i — l)th stage. As with our choice of Ci, we have quite a bit of freedom. 
In particular, the proposals can depend on the training data (X, Y). An obvious 
choice is to take Qi to be the conditional distribution of % given 7j-i under the 
prior, i.e., setting Qi(7i|7;~i) = P(7j|7;-i). Informally, this choice would lead 
us to propose extensions to trees at each stage of the algorithm by sampling from 
the prior, so we will refer to this as the prior proposal kernel (aka the Bayesian 
bootstrap filter (Gordon et al., 1993)). 

We consider two additional proposal kernels: The first, 

Qi(Ti\Ti-i)=PY(n\%-i), (6) 

is called the (one-step) optimal proposal kernel because it would be the optimal 
kernel assuming that the ith stage were the final stage. We return to discuss 
this kernel in Section 3.1. The second alternative, which we will refer to as the 
empirical proposal kernel, is a small modification to the prior proposal, differing 
only in the choice of the split point r. Recall that, in the prior, r^(p) is chosen 
uniformly from the interval I^.( p y This ignores the empirical distribution given by 
the input data X N ^ in the partition. We can account for this by first choosing, 
uniformly at random, a pair of adjacent data points along feature dimension Ki(p), 
and then sampling a cut Ti(jp) uniformly from the interval between these two data 
points. 

The pseudocode for our proposed SMC algorithm is given in Algorithm 1. Note 
that the SMC framework only requires us to compute the density of 71 under the 
target distribution up to a normalization constant. (In fact, the SMC algorithm 
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Algorithm 1 SMC for Bayesian decision tree learning 

Inputs: Training data (X,Y) 

Number of particles M 

Initialize: T Q m) = E ( Q m) = {e} 
T (m) = K (m) = 



(m) 



W = Em W 



for i = 1 : MAX-STAGES do 
for m = 1 : M do 

Sample 7~ (m) from | l£*>) 

where T} m) := (T< m) , K < m > , r< m) , E\ m) ) 

Update weights: (Here P, denote their densities.) 



nTj m) )g{Y\T} m \x) 



w [ m) = — f^f (7) 

(8) 



(m) 

®i(rf m) \T^)9(Y\7<^,X) 
end for 

Compute normalization: Wi = E m wf 1 ^ 

Normalize weights: (Vm) w,' m ' = w[ m ^ /Wi 

if (E m (^ (m) ) 2 ) _1 < ESS-THRESHOLD then 

(m')r . 



(Vm) Resample indices j m from 53 m ' ^ 
(Vm) Tt l) <- 7^ (im) ; W | m) Wi/M 
end if 

if (Vm) = then 

exit for loop 
end if 
end for 

return Estimated marginal probability Wi/M and 

weighted samples {w^ , , K,\ m ' , }^ =1 . 



produces an estimate of the normalization constant, which, at the end of the algo- 
rithm, is equal to the marginal probability of the labels Y given X , with the latent 
decision tree T marginalized out.) In general, the joint density of a Markov chain 
can be hard to compute, but because the set of nodes Cj considered at each stage 
is a deterministic function of 7i, the path (7o,7i, . . . ,7i-i) taken is a deterministic 
function of %. As a result, the joint density is simply a product of probabilities 
for each stage. The same property holds for the proposal kernels defined above be- 
cause they use the same candidate set Cj, and have the same support as P. These 
properties justify equations (7) and (8). 
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3.1. The one-step optimal proposal kernel. In this section we revisit the def- 
inition of the one-step optimal proposal kernel. While the prior and empirical 
proposal kernels are relatively straightforward, the one-step optimal proposal ker- 
nel is defined in terms of an additional conditioning on the labels Y , which we now 
study in greater detail. 

Recall that the one-step optimal proposal kernel Qj is given by Qi(% \ %-i) = 
^i(Ti\Ti-i). To begin, we note that, conditionally on 77— 1 and Y, the subtrees 
rooted at each node p <E Cj_i are independent. This follows from the fact that the 
likelihood of Y given % factorizes over the leaves. Thus, the proposal's probability 
density is 

Qi(7I|7I_i)= J] Q i (Pi,p,Ki(p),T i (p)), (9) 

where Qi is the probability density of the cuts at node p under Q,-, and pi tP denotes 
whether the node was split or not. On the event we split a node p € C-t-i, if 
we condition further on Ki(p) and pi :P , we note that the conditional likelihood of 
Yjv(p)> when viewed as a function of the split Tj(p), is piecewise constant, and in 
particular, only changes when the split crosses an example. It follows that we can 
sample from this proposal by first considering the discrete choice of an interval, and 
then sampling uniformly at random from within the interval, as with the empirical 
proposal. Some algebra shows that 

Qt(Pr,p = Stop) OC (l - ^ + "p|)/3 s ) Z( Y N(p)\X N{p) ) , (10) 

and 

Qi{Pi,p = Split, Ki(p),Ti(p)) OC ^ + ^|^ 3 jjp 1 i n K Y N(pj)\X N (pj))- (11) 

3.2. Computational complexity. Let Ud denote the number of unique values 
in dimension d, N p denote the number of training data points at node p and 
rf- m > denote the number of nodes in particle m. For all the SMC algorithms, the 
space complexity is 0{MN) + 0(J2 d u d) + C(J2 m V (m) )- Tne time complexity 
is 0(DN log N) + M'^2 p O(2DN p + N p ) for prior and empirical proposals and 
A/^ p (-D0(iVplog N p ) + N p ) for the optimal proposal. The optimal proposal typi- 
cally requires higher computational cost per particle, but fewer number of particles 
than the prior and empirical proposals. 

4. Experiments 

In this section, we experimentally evaluate the design choices of the SMC al- 
gorithm (proposal, expansion strategy, number of particles and "islands") on real 
world datasets. In addition, we compare the performance of SMC to the most pop- 
ular MCMC method for Bayesian decision tree learning (Chipman et al., 1998), as 
well as CART, a popular (non-Bayesian) tree induction algorithm. We evaluate all 
the algorithms on the following datasets from the UCI ML repository (Asuncion 
and Newman, 2007): 

• MAGIC gamma telescope data 2004 (magic-04): N = 19020, D = 10, 
K = 2. 

• Pen-based recognition of handwritten digits {pen- digits): N = 10992, D = 
16, K = 10. 
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Previous work has focused mainly on small datasets (e.g., the Wisconsin breast 
cancer database used by Chipman et al. (1998) has 683 data points). We chose 
the above datasets to illustrate the scalability of our approach. For the pen-digits 
dataset, we used the predefined training/test splits, while for the other datasets, 
we split the datasets randomly into a training set and a test set containing approx- 
imately 70% and 30% of the data points respectively. 

We implemented our scripts in Python and applied similar software optimization 
techniques to SMC and MCMC scripts 1 . Our experiments were run on a cluster 
with machines of similar processing power. 

4.1. Design choices in the SMC algorithm. In these set of experiments, we 
fix the hyperparamctcrs to a — 5.0, a s = 0.95, /3 S — 0.5 and compare the predictive 
performance of different configurations of the SMC algorithm for this fixed model. 
Under the prior, these values of a s , /3 S produce trees whose mean depth and number 
of nodes are 5.1 and 18.5, respectively. Given M particles, we use an effective 
sample size (ESS) threshold of M/10 and set the maximum number of stages to 
5000 (although the algorithms never reached this number). 

4.1.1. Proposal choice and node expansion. We consider the SMC algorithm pro- 
posed in Section 3 under two proposals: optimal and prior. (The empirical pro- 
posal performed similar to the prior proposal and hence we do not report those 
results here.) We consider two strategies for choosing Ci, i.e., the list of nodes 
considered for expansion at stage i: (i) node-wise expansion, where a single node 
is considered for expansion per stage (i.e., Ci is a singleton chosen detcrministi- 
cally from eligible nodes Ei), and (ii) layer-wise expansion, where all nodes at a 
particular depth are considered for expansion simultaneously (i.e., Ci = Ei). For 
node-wise expansion, we evaluate two strategies for selecting the node determinis- 
tically from Ci'. (i) breadth-first priority, where the oldest node is picked first, and 
(ii) marginal-likelihood based priority, where we expand the node with the lowest 
marginal likelihood. Both of these priority schemes performed similarly; hence we 
report only the results for breadth-first priority. We use multinomial resampling in 
our experiments. We also evaluated systematic resampling (Douc et al., 2005) but 
found that the performance was not significantly different. 

We report the log predictive probability on test data as a function of runtime 
and of the number of particles (similar trends are observed for test accuracy; see 
Appendix A). The times reported do not account for prediction time. We average 
the numbers over 10 random initializations and report standard deviations. The 
results are shown in Figure 2. In summary, we observe the following: 
(1) node-wise expansion outperforms layer-wise expansion for prior proposal. The 
prior proposal does not account for likelihood; one could think of the resam- 
pling steps as 'correction steps' for the sub-optimal decisions sampled from the 
prior proposal. Because node-wise expansion can potentially resample at every 
stage, it can correct individual bad decisions immediately, whereas layer-wise 
expansion cannot. In particular, we have observed that layer-wise expansion 
tends to produce shallower trees compared to node-wise expansion, leading to 
poorer performance. This phenomenon can be explained as follows: as the 
depth of the node increases, the prior probability of stopping increases whereas 



The scripts can be downloaded from the authors' webpages. 
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the posterior probability of stopping might be quite low. In node-wise expan- 
sion, the resampling step can potentially retain the particles where the node has 
not been stopped. However, in layer-wise expansion, too many nodes might 
have stopped prematurely and the resampling step cannot 'correct' all these 
bad decisions easily (i.e., it would require many more particles to sample trees 
where all the nodes in a layer have not been stopped). Another interesting 
observation is that layer-wise expansion exhibits higher variance: this can be 
explained by the fact that layer-wise expansion samples a greater number of 
random variables (on average) than node-wise before resampling, and so suffers 
for the same reason that importance sampling can suffer from high variance. 
Note that both expansion strategies perform similarly for the optimal proposal 
due to the fact that the proposal is already optimal (i.e. it accounts for the 
likelihood) and resampling does not affect the results significantly. Due to its 
superior performance, we consider only node-wise expansion in the rest of the 
paper. 

(2) The plots on the right side of Figure 2 suggest that the optimal proposal requires 
fewer particles than the prior proposal (as expected). However, the per-stage 
cost of optimal proposal is much higher than the prior, leading to significant 
increase in the overall runtime (sec Section 3.2 for a related discussion). Hence, 
the prior proposal offers a better predictive performance vs computation time 
tradeoff than the optimal proposal. 

(3) The performance of optimal proposal saturates very quickly and is near-optimal 
even when the number of particles is small (M = 10). 

4.1.2. Effect of irrelevant features. In the next experiment, we test the effect of 
irrelevant features on the performance of the various proposals. We use the madelon 
dataset 2 for this experiment, in which the data points belong to one of 2 classes and 
lie in a 500-dimensional space, out of which only 20 dimensions are deemed relevant. 
The training dataset contains 2000 data points and the test dataset contains 600 
data points. We use the validation dataset in the UCI ML repository as our test 
set because labels are not available for the test dataset. 

The setup is identical to the previous section. The results are shown in Figure 
3. Here, the optimal proposal outperforms the prior proposal in both the columns, 
requiring fewer particles as well as outperforming the prior proposal for a given 
computational budget. While this dataset is atypical (only 4% of the features are 
relevant), it illustrates a potential vulnerability of the prior proposal to irrelevant 
features. 

4.1.3. Effect of the number of islands. Averaging the results of several independent 
particle filters (aka islands) is a way to reduce variance at the cost of bias, compared 
with running a single, larger filter. In the asymptotic regime, this would not make 
sense, but as we will see, performance is improved with multiple islands, suggesting 
we are not yet in the asymptotic regime. In this experiment, we evaluate the 
effect of the number of islands on the test performance of the prior proposal. We 
fix the total number of particles to 2000 and vary /, the number of islands (and 
hence, the number of particles per island). Note that all the islands operate on 
the entire dataset unlike bagging. Here, we present results only on the pen-digits 



'http : //archive . ics . uci . edu/ml/datasets/Madelon 
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Figure 2. Results on pen-digits (top), and magic-04 (bottom). 
Left column plots test logp(y|x) vs runtime, while right column 
plots test \ogp(y\x) vs number of particles. The blue circles and red 
squares represent optimal and prior proposals respectively. The 
solid and dashed lines represent node-wise and layer-wise proposals 
respectively. 

dataset (see Appendix B for results on the magic-04 dataset). The results are 
shown in Figure 4. We observe that (i) the test performance drops sharply if we 
use fewer than 100 particles per island and (ii) when M/I > 100, the choices of 
/ G [5, 100] outperform 1 = 1. Since the islands are independent, the computation 
across islands is 'embarrassingly parallelizable'. 

4.2. SMC vs MCMC. In this experiment, we compare the SMC algorithms to 
the MCMC algorithm proposed by Chipman ct al. (1998), which employs four 
types of Metropolis-Hastings proposals: grow (split a leaf node into child nodes), 
prune (prune a pair of leaf nodes belonging to the same parent), change (change 
the decision rule at a node) and swap (swap the decision rule of a parent with the 
decision rule of the child). In our experiments, we average the MCMC predictions 
over the trees from all previous iterations. 

The experimental setup is identical to Section 4.1, except that we fix the number 
of islands, 1 = 5. We vary the number of particles for SMC* ! and the number of 



We fix / = 5 so that the minimum value of M (= 100) corresponds to M/I = 20 particles 
per island. Further improvements could be obtained by 'adapting' / to M as discussed in Section 
4.1.3. 
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Figure 3. Results on madelon dataset: The top and bottom rows 
display log p(y\x) and accuracy on the test data against runtime 
(left) and the number of particles (right) respectively. The blue 
circles and red squares represent optimal and prior proposals re- 
spectively. 




Number of islands Number of islands 



Figure 4. Results on pen-digits: Test logp(y\x) (left) and accu- 
racy (right) vs I and M/I for fixed M = 2000. 

iterations for MCMC and plot the log predictive probability and accuracy on the 
test data as a function of runtime. In Figure 5, we observe that SMC (prior, node- 
wise) is roughly two orders of magnitude faster than MCMC while achieving similar 
predictive performance on pen-digits and magic-04 datasets. Although the exact 
speedup factor depends on the dataset in general, we have observed that SMC 



TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES 



13 




Figure 5. Results on pen-digits (top row), and magic-04 (bottom 
row). Left column plots test log p(y\x) vs runtime, while right col- 
umn plots test accuracy vs runtime. The blue cirlces, red squares 
and black diamonds represent optimal, prior proposals and MCMC 
respectively. 

(prior, node-wise) is at least an order of magnitude faster than MCMC. 

The SMC runtimes in Figure 5 are recorded by running the / islands in a serial 
fashion. As discussed in Section 4.1.3, one could parallelize the computation leading 
to an additional speedup by a factor of /. In the pen-digits dataset, the performance 
of prior proposal seems to drop as we increase M beyond 2000. However, the 
marginal likelihood on the training data increases with M (see Appendix C). We 
believe that the deteriorating performance is due to model misspecification (axis- 
aligned decision trees are hardly the 'right' model for handwritten digits) rather 
than the inference algorithm itself: 'better' Bayesian inference in a misspecihed 
model might lead to a poorer solution (see (Minka, 2000) for a related discussion). 

To evaluate the sensitivity of the trends above to the hyper parameters a, a s , /3 S , 
we systematically varied the values of these hyper parameters and repeated the 
experiment. The results are qualitatively similar. See Appendix D for additional 
information. 

4.3. SMC vs other existing approaches. The goal of these experiments was 
to verify that our SMC approximation performed as well as the "gold standard" 
MCMC algorithms most commonly used in the Bayesian decision tree learning set- 
ting. Indeed, our results suggest that, for a fraction of the computational budget, 
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we can achieve a comparable level of accuracy. In this final experiment, we re- 
affirm that the Bayesian algorithms are competitive in accuracy with the classic 
CART algorithm. (There are many other comparisons that one could pursue and 
other authors have already performed such comparisons. E.g., Taddy ct al. (2011) 
demonstrated that their tree structured models yield similar performance as Gauss- 
ian processes and random forests.) We used the CART implementation provided 
by scikit-learn (Pedregosa et al., 2011) with two criteria: gini purity and infor- 
mation gain and set min_samples_leaf = 10 (minimum number of data points at 
a leaf node). 4 In addition, we performed Laplacian smoothing on the probability 
estimates from CART using the same a as for the Bayesian methods. Our Python 
implementation of SMC takes about 50-100x longer to achieve the same test ac- 
curacy as the highly-optimized implementation of CART. For this reason, we plot 
CART accuracy as a horizontal bar. The accuracy and log predictive probability 
on test data are shown in Figure 5. The Bayesian decision tree frameworks achieve 
similar (or better) test accuracy to CART, and outperform CART significantly in 
terms of the predictive likelihood. SMC delivers the benefits of having an approxi- 
mation to the posterior, but in a fraction of the time required by existing MCMC 
methods. 

5. Discussion and Future work 

We have proposed a novel class of Bayesian inference algorithms for decision 
trees, based on the sequential Monte Carlo framework. The algorithms mimic clas- 
sic top-down algorithms for learning decision trees, but use "local" likelihoods along 
with resampling steps to guide tree growth. We have shown good computational 
and statistical performances, especially compared with a state-of-the-art MCMC 
inference algorithm. Our algorithms are easier to implement than their MCMC 
counterparts, whose efficient implementations require sophisticated book-keeping. 

We have also explored various design choices leading to different SMC algo- 
rithms. We have found that expanding too many nodes simultaneously degraded 
performance, and more sophisticated ways of choosing nodes surprisingly did not 
improve performance. Finally, while the one-step optimal proposal often required 
fewer particles to achieve a given accuracy, it was significantly more computation- 
ally intensive than the prior proposal, leading to a less efficient algorithm overall on 
datasets with few irrelevant input dimensions. As the number of irrelevant dimen- 
sions increased the balance tipped in favour of the optimal proposal. An interesting 
direction of exploration is to devise some way to interpolate between the prior and 
optimal proposals, getting the best of both worlds. 

The model underlying this work assumes that the data is explained by a single 
tree. In contrast, many uses of decision trees, e.g., random forests, bagging, etc., 
can be interpreted as working within a model class where the data is explained by 
a collection of trees. Bayesian additive regression trees (BART) (Chipman et al., 
2010) are such a model class. Prior work has considered MCMC techniques for 
posterior inference (Chipman ct al., 2010). A significant but important extension 
of this work would be to tackle additive combinations of trees, potentially in a way 
that continues to mimic classic algorithms. 



Lower values (min_samples_leaf = 1, 5) tend to yield slightly higher test accuracies (compa- 
rable to SMC and MCMC) but much lower predictive probabilities. 
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Finally, in order to more closely match existing work in Bayesian decision trees, 
we have used a prior over decision trees that depends on the input data X. This 
has the undesirable side-effect of breaking exchangeability in the model, making it 
incoherent with respect to changing dataset sizes and to working with online data 
streams. One solution is to use an alternative prior for decision trees, e.g., based on 
the Mondrian process (Roy and Teh, 2009), whose projectivity would re-establish 
exchangeability while allowing for efficient posterior computations that depend on 
data. 
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Appendix A. Effect of SMC proposal and expansion strategy on test 

ACCURACY 

The results are shown in Figure 6. 
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Figure 6. Results on pen-digits (top), and magic-04 (bottom). 
Left column plots test accuracy vs runtime, while right column 
plots test accuracy vs number of particles. The blue circles and red 
squares represent optimal and prior proposals respectively. The 
solid and dashed lines represent node-wise and layer-wise proposals 
respectively. 



Appendix B. Effect of the number of islands: magic-04 dataset 
The results are shown in Figure 7. 

Appendix C. Marginal likelihood 

The log marginal likelihood of the training data for different proposals is shown 
in Figure 8. As the number of particles increases, the log marginal likelihood of 
prior and optimal proposals converge to the same value (as expected). 

Appendix D. Sensitivity of results to choice of hyperparameters 

In this experiment, we evaluate the sensitivity of the runtime vs predictive per- 
formance comparison between SMC (prior and optimal proposals), MCMC and 
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Number of islands Number of islands 

Figure 7. Results on magic-04- Test logp(y|x) (left) and accu- 
racy (right) vs I and M/I for fixed M = 2000. 




J 10 2 10 3 10 4 J 10 2 10 3 10" 

Number of particles Number of particles 



Figure 8. Results on pen-digits (left), and magic-04 (right). 
Mean log marginal likelihood (i.e., mean \ogp(Y\X) for training 
data averaged across 10 runs) vs number of particles. The blue 
circles and red squares represent optimal and prior proposals re- 
spectively. 

CART to the choice of hyper parameters a (Dirichlet concentration parameter) 
and a Sl f3 s (tree priors). We consider only node-wise expansion since it consistently 
outperformed layer-wise expansion in our previous experiments. In the first vari- 
ant, we fix a — 5.0 (since we do not expect it to affect the timing results) and 
vary the hyper parameters from a s — 0.95, j3 s = 0.5 to a s — 0.8, f3 s — 0.2 (bold 
reflects changes) and also consider intermediate configurations a s — 0.95, f3 s — 0.2 
and oc s — 0.8, (3 S — 0.5. In the second variant, we fix a s — 0.95, /3 S — 0.5 and set 
a — 1.0. Figures 9, 10, 11 and 12 display the results on pen-digits (top row), and 
magic-04 (bottom row). The left column plots test logp(y|a;) vs runtime, while 
the right column plots test accuracy vs runtime. The blue circles and red squares 
represent optimal and prior proposals respectively. Comparing the results to Fig- 
ure 5, we observe that the trends are qualitatively similar to those observed for 
a = 5.0, a s = 0.95, f3 s — 0.5 in Section 4.2: (i) SMC consistently offers a better 
runtime vs predictive performance tradeoff than MCMC, (ii) the prior proposal of- 
fers a better runtime vs predictive performance tradeoff than the optimal proposal, 
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(iii) a = 1.0 leads to similar test accuracies as a 
are obviously not comparable). 
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FIGURE 9. Hyperparameters: a — 5.0, a s — 0.8, j3 a = 0.5 
(see main text for additional information). 
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Figure 10. Hyperparameters: a — 5.0, a s = 0.95, (3 g = 0.2 
(see main text for additional information). 
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Figure 11. Hyperparameters: a = 5.0, a s — 0.8, (3 S — 0.2 (see 
main text for additional information). 
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Figure 12. Hyperparameters: a. — 1.0, a s — 0.95, /3 S = 0.5 (see 
main text for additional information). 



